Web Scraping

First, we are loading environment variables, this is not required for non-professional projects but it is a good practice to hide your private variables that contains database credentials which might make your server vulnerable.

We are defining our functions and importing libraries which will be used in the next steps.

Creating tables that will contain our data and creating database functions.

Getting last execution dates in the tables to not import the same data we collected for performance.

Scraping the website to gather restaurant lists and their links, in this part we need to scroll down to load all the restaurants.

Looping through all of the restaurants and scraping their products.

Looping through all of the restaurants and scraping their reviews and given scores for each comment.

Looping through all of the restaurants and scraping their product images for deep learning purposes.

Data Preprocessing

Cleaning our review dataset with lowercasing all the text, removing punctuations, removing some Turkish stopwords, and estimating when the reviews are made, we are also transforming string scores to the integer type.

Exploratory Data Analysis

We have 1,189 distinct resturants, 131,131 distinct products with 1,077 distinct categories and 23,109 dstinct names only 11% of the products are discounted.

Our product prices are averaging around 41 with 30.4 standart deviation. We have around 5,112 product images.

In user reviews we have 271,843 comments which 221,312 of them are distinct in user scores most of the user scores are 10 with around 56-64% of the scores and the second most common score is 1 with around 10-13% of the scores.

Dataset does not contain any missing data except in user speed scores which contains less than 1% of the data with 3,495 rows.

Machine Learning

Bag of Words with CountVectorizer

We are going to transform our reviews with CountVectorizer, process is similar to one-hot encoding which counts how many of these words appear in a label.

Gaussian Naive Bayes

Support Vector Machines

C-Support Vector Classification

Linear Support Vector Classification

Decision Tree Classifier

Logistic Regression

Bag of Words with TF-IDF Vectorizer

We are going to transform our reviews with TF-IDF Vectorizer, process is similar to one-hot encoding which calculates how frequently these words appear in a label.

Gaussian Naive Bayes

Support Vector Machines

C-Support Vector Classification

Linear Support Vector Classification

Decision Tree Classifier

Logistic Regression

Segmentation

K-Means Clustering

Gaussian Mixture Models

Markov Chains

Markov chains calculate how frequently words follow each other and generate new text depending on the first words we are given. These models can be used to create fake reviews.

Convolutional Neural Network

Convolutional neural networks are used for image classification purposes. For our labels, we are considering if a product name, description, or product category contains the word pizza in it or not.

Sources

  1. Selenium Documentation
  2. Selenium choose element by partial id
  3. Selenium scroll down to end of the page
  4. Selenium click button
  5. Markov Chains
  6. Bag of Words
  7. Turkish Porter Stemmer
  8. Scikit-Learn: Save & Restore Models
  9. Joblib vs Pickle
  10. Hide warnings in Python
  11. Slow scrolling down the page using selenium
  12. Download images using urllib
  13. Saving an image to Postgres
  14. Convolutional Neural Networks
  15. Combine several images
  16. Postgresql keeping only alphanumeric characters